# **Hardware Accelerator for Raptor Decoder** Todor Mladenov, Saeid Nooshabadi and Kiseon Kim Abstract – Hard Raptor Codes (designed for erasure channels) are widely used for mobile multimedia content delivery, and yet they have not been investigated in the context of embedded systems where the energy dissipation is as important as the timing performance. The most time consuming part of Raptor decoder is the matrix inversion operation. This paper proposes a hardware accelerator, for two matrix inversion algorithms, as a part of Raptor decoder implemented on a system on a chip (SoC) platform with a soft-core embedded processor. The performance, energy profile and resource implication are analyzed and compared with a pure software implementation. Keywords - Raptor Codes, Matrix Inversion, FPGA ## I. INTRODUCTION Multimedia on mobile devices requires secure delivery of various sized data with minimum negotiation overhead. Here is where Raptor codes [1],[2] have come quite useful and outperformed the already well known coding schemes. Recently there have been two standards, namely 3GPP MBMS (Multimedia Broadcast/Multicast Services) [3] and DVB-H [4], which have included systematic Raptor codes in their specifications for content delivery. Although Raptor codes are growing as a preferred mobile multimedia delivery scheme, experimental data relating to their implementations are reported from simulation on a workstation platform. As far as we are aware, their hardware implementations for mobile embedded systems have not yet been investigated. This paper looks at the implementation of Raptor codes on an embedded system platform, where resources in terms of computational and power dissipation are limited. The most demanding part of the Raptor decoder, profiled to take 92% of the decoding time, is the matrix inversion operation. This motivates us to look for a hardware implementation of the matrix inversion that would reduce both the decoding time and the energy dissipation. We propose such dedicated hardware blocks for the well known Gaussian elimination (GE) algorithm and the efficient matrix inversion algorithm (SA) proposed in [3],[4]. The relative performance of these two algorithms in terms of decoding time, power, energy and area trade offs are demonstrated. We propose and design hardware enhancements for GE and SA based on their algorithmic structures. Finally, based on the profiling data, suitability assessments are made for the implementation of GE and SA on an embedded system platform. The chosen embedded system platform is a NIOS soft-core processor, running on an EP1S40F780C5 Altera Stratix FPGA, with 41,250 logical elements, 3,423,744 total memory bits (2,097,152 bits maximum single memory size), 14 DSP blocks and 129 (9-bit) embedded multipliers. T. Mladenov, S. Nooshabadi and K. Kim are with the Department of Information and Communications, Gwangju Institute of Science and Technology, Gwangju, South Korea, e-mails: {todor,saeid,kskim@gist.ac.kr} Fig.1. Raptor Codes on hardware/software NIOS embedded system This device is housed on the NIOS Development Board Stratix Professional Edition with 16MB of SDRAM memory. NIOS soft-core processor can be augmented with custom instructions and additional peripheral devices. Fig.1 depicts the high level block diagram of the embedded SoC platform for the implementation of Raptor codes. This paper is organized as follows. In Section II, we briefly explain the operation of a Raptor decoder. Section III describes and presents the details of GE and SA, the two algorithms for the matrix inversion used in this paper. Section IV presents the hardware implementation performance results in terms of execution time, power and energy, and hardware resources and compares them to the software implementation presented in [5]. Section V concludes the paper. ## II. SYSTEMATIC RAPTOR DECODER A Raptor code can be viewed as a regular linear block code, which makes it possible to be represented by a generator matrix. A block diagram of systematic Raptor encoder and decoder is shown in Fig.2. The decoding process of Raptor codes exchanges the positions of the Code Constraints Processor and the LT Encoder(Decoder) with the proper dimensions for the G<sub>LT</sub> LT generator matrices. The output vector $\mathbf{e}$ , containing N symbols, generated by the encoder is received by the decoder across the channel as input vector $\mathbf{e}'$ , containing N' ( $K \le N' \le N$ ) encoded symbols (which may be nonconsecutive, where K is the number of source symbols). Vector e' is padded with S+H zeroes to dimension it to (M=N'+S+H). Starting with (N'-K) the value of N' is iteratively incremented to make the Code Constraints Preprocessor matrix A invertible. The difference (N'-K) is equal to or greater than the number of received encoded symbols lost in the channel. The decoding is performed according to (1) and (2), where $G_{LT}$ is a LT generator matrix with dimension of $K \times L$ . All operations are performed in Galois Field GF(2). Fig.2. Block diagram of the systematic Raptor Codes $$c_{[0:L-1]} = [z^T e^{T}] \cdot A_{M \times L}^{-1}$$ (1) $$t_{[0:K-1]} = G_{LT} \cdot c_{[0:L-1]} \tag{2}$$ At the decoder side the submatrix $G_{LT}$ (1..N') is first built from the input data. The sequence number of the n<sup>th</sup> received encoded symbol is used to generate the n<sup>th</sup> row of the submatrix $G_{LT}$ (1..N') through the LT encoding process. ### III. MATRIX INVERSION ALGORITHMS The most common matrix inversion algorithm is GE [6]. The pseudo code for a GF(2) GE algorithm where elimination and backward substitution are performed together, is shown in Algorithm 1 .The main operations involved in this algorithm are "row exchange" and "row XOR" (Exclusive-OR). The specifications in [3] and [4] recommend SA as a more efficient technique for matrix inversion. A version of SA technique is presented in Algorithm 2. The operation of SA is as follows. In *Phase I* matrix **A** is reduced to the following form: $$A_{Phase_{I}(M\times L)} = \begin{bmatrix} I_{i} & U_{M\times u} \\ Z_{(M-i)\times i} & U_{M\times u} \end{bmatrix}$$ (3) This reduction is performed iteratively, by first relocating the rows containing the minimum number of "1s" to the top, and then moving the first column having "1" in this row to the beginning at column location i, and the remaining columns with "1" to the end of the row at column locations m-u-1. Note that i and u are initialized to 0. While, in each row, i increments only once per row, u can increment multiple times. In each iteration of the algorithm one row from the top is excluded from the consideration. Further, the count of "1s" within a row is confined to columns i to (m-u-1). Phase I is completed when (L=i+u). **Algorithm 1** The Gaussian Elimination algorithm (GA) for matrix inversion over GF(2) ``` Require: A \in \{0,1\}^{n \times m} 1: for i = 0 : n - 1 do j = i; 2: while a_{ji} == 0 do 4: j = j + 1; 5: if i \neq j then row_exchange(\vec{a}_i, \vec{a}_j); 6: for (k = 0 : n - 1) \& (k \neq i) do 7: if a_{ki} == 1 then 8: 9: for l = 0 : m - 1 do 10: a_{kl} = a_{kl} \bigoplus a_{il} ``` **Algorithm 2** The efficient matrix inversion algorithm (SA) over GF(2) proposed in [3], [4] ``` Require: A \in \{0,1\}^{n \times m} Ensure: i = 0; u = 0; 1: Phase I 2: while i + u < m do 3: r = 1; 4: while r \leq m do for s = i : n - 1 do 5: \mathbf{if} \ (\mathbf{count\_ones\_\&\_store\_pos}(s) == r) \ \mathbf{then} 6: 7: break; if (row_found) then 8: break; 9: 10: 11: r = r + 1; if (i \neq s) then 12: row_exchange(\vec{a}_i, \vec{a}_s); 13: if (a_{ii} == 0) then 14: col_exchange(i, ones\_col[0]); 15: for h = 1 : r - 1 do 16: col_exchange((m-u-1), ones_col[h]); 17: 18: u = u + 1; for (k = (i + 1): n - 1) do 19: if a_{ki} == 1 then 20: for l = 0 : m - 1 do 21: a_{kl} = a_{kl} \bigoplus a_{il} 22: 23. i = i + 1: 24: Phase II 25: Gaussian Elimination on submatrix S = [i+1:n-1][i+1:m-1] 26: Phase III for (k = 0 : i) do for (s = i + 1 : m - 1) do 28: if a_{ks} == 1 then 29: 30: a_{ks} = a_{ks} \bigoplus a_{ss} ``` In *Phase II* submatrix $\mathbf{U}$ is partitioned into lower and upper submatrices $\mathbf{U'}_{\mathbf{i} \times \mathbf{u}}$ and $\mathbf{U''}_{\mathbf{M} - \mathbf{i} \times \mathbf{u}}$ , respectively. The lower matrix $\mathbf{U''}_{\mathbf{M} - \mathbf{i} \times \mathbf{u}}$ is transformed into the identity matrix $\mathbf{I}_{\mathbf{u}}$ through the normal Gaussian elimination technique. The (M-L) rows that are left below $\mathbf{I}_{\mathbf{u}}$ are discarded. The form of the matrix produced at the end of *Phase II* is: $$A_{Phase_{II}} = \begin{bmatrix} I_i & U_{i \times u} \\ Z_{u \times i} & I_u \end{bmatrix} \tag{4}$$ In *Phase III* the upper matrix U' is zeroed by XOR of its individual rows with the sufficient number of rows from the lower matrix $I_{u}$ . $$A_{Phase_{III}} = \begin{bmatrix} I_i & Z_{i \times u} \\ Z_{u \times i} & I_u \end{bmatrix}$$ (5) It was shown in [5] that for pure software implementation and the "PACKED WORD" memory organization (where 32 matrix elements are packed together into a single 32-bit memory word) the simple GE algorithm outperforms SA by a factor of 20.48. #### IV. HARDWARE IMPLEMENTATION The inversion of the *Code Constraints Processor* matrix in Fig. 2 has been profiled to be the most time consuming part of the Raptor decoder. To reduce this computational bottleneck, in what follows, a dedicated hardware block for matrix inversion is proposed (Dedicated Raptor Code Hardware in Fig.1), and its performance, power, energy dissipation, and the required hardware resources are presented and analyzed. Fig.3 shows the block diagram of the hardware accelerator block. The Avalon switch fabric uses a slave port to set the Control Registers that initialize the hardware accelerator. During the initialization the size and initial addresses for matrix and vectors are set. Control Registers also control operation of the hardware like initiating START and STOP commands. The Hardware accelerator block uses an Avalon master port to access the whole memory mapped address space of the NIOS processor, and send interrupts to NIOS and receive interrupts from other peripheral devices. For a more flexible design Row & Column Exchange Logic, and Row XOR Logic units are designed to be self contained units that interface with the GE/SA Control Logic finite state machine. They share a common address generation path through the Arithmetic Unit that contains two 32-bit adders and one 16-bit multiplier. The Latch circuit only exists in the SA version of the hardware accelerator. ## A. Performance SA algorithm includes a procedure for counting the number of "1s" in the matrix rows (Phase 1, lines 4 to 11) in Algorithm 2. After a memory read access the Avalon switch fabric does not retain the data word just read, but goes into the "high impedance" state until the next read cycle. Since we use the "PACKED WORD" memory organization we need a mechanism to retain the read data to avoid multiple memory accesses and bit masking to acquire all the bits in a row. Fig.3. Hardware Accelerator Block 1) Memory access interface enhancement: The proposed hardware block in Fig. 3 for the SA algorithm employs a special circuit that latches the data word read from the memory into the Latch until the next word is needed. The subsequent memory accesses for the matrix elements are made to this register, as long as the required matrix entries are in the current word. Fig. 4 depicts the performance of the SA algorithm with and without the addition of this circuit for several values of K. This simple enhancement improves the performance by a factor of 5.5. It should be noted that the performance improvement is solely due to the "the count of 1s" operation in *Phase I* of Algorithm 2 for SA where memory accesses are sequential. However, there is no performance improvement in *Phases II and III* of this algorithm, and Algorithm 1 for GE, where "row exchange", "column exchange" and "row XOR" operations involve search for the elements of the matrix from non sequential addresses. At 100 and 35 MHz speeds for the SDRAM, and the logic, the hardware versions of GE and SA perform better than their software versions, as reported in [5], by factors of 1.76 and 20.31, respectively. This is in spite of the fact that the software implementation of the algorithms runs on a 100 MHz NIOS processor; a clock rate almost 3 times faster. It should also be noted that hardware version of GE performs better than the hardware version of SA by a factor of 1.48. This ratio is less than the equivalent 20.48 ratio for Fig.4. Performance of the hardware-enhanced Phase I of SA algorithm Fig.5. Effect of SDRAM memory speed on GE and SA algorithms the software implementations in [5]. The reduction in the performance ratio indicates that the SA algorithm benefits far more that the GE in switching from software to the hardware platform. This is mainly due to the speed up in the "the count of 1s" procedure due to the simple enhancement. 1) Effect of SDRAM and logic clock rates: Fig.5 shows the effect of the SDRAM speed on GE and SA algorithms for (K=1024). The data is plotted for three different logic clocks - 25, 30 and 35 MHz. While both algorithms benefit from a faster SDRAM, the effect of memory speed is more noticeable on the operation of the GE algorithm. This is especially true for the slower SDRAM region where the plots for SA and GE cross over each other. The point of cross over moves to right for the higher logic speed. This indicates that a faster logic requires a faster SDRAM if the relative superior performance of GE algorithm over SA were to be maintained. This highlights the fact that the SA algorithm benefits more from a faster logic. From Fig. 5 it can be inferred that the enhancement in the "the count of 1s" procedure has the following effect. The increase in the logic speed from 25 MHz to 35 MHz is slightly more effective in improving the performance of the SA algorithm. This is because the number of slow memory accesses is significantly reduced due to the enhancement which causes the computation to be more dependent on the speed of the logic. #### B. Power and Area Table 1 shows the power, energy and hardware resource requirements for the FPGA implementation of the NIOS soft-core processor with the proposed dedicated hardware accelerator block for the matrix inversions, with the SDRAM and logic operating at the speeds of 100 and 35 MHz, respectively. The NIOS processor is placed in idle state when the control is passed to the dedicated hardware accelerator. In spite of the addition of a new block, the power dissipation is less for all cases, when compared to the corresponding software implementations in [5]. From the data in Table 1 it can be seen that the energy required for the matrix inversion is 1.44 times less for the GE algorithm compared to the SA. The implementation with the dedicated hardware block for GE needs several times less energy to invert the same matrix compared to entirely software implementation [5]. The saving factor is 3.43. The similar energy saving factor for SA is, impressively, much higher at 22.04. Table 1. Power, energy and resources for the hardware implementation for (K=128) with $100\,\mathrm{MHz}\,\mathrm{SDRAM}$ and $35\,\mathrm{MHz}\,\mathrm{Logic}$ speed. | GE | Power [MW] | 1,564.71 | |-----------|----------------|----------| | SA | POWER [MW] | 1,564.02 | | GE | ENERGY [MJ] | 53.86 | | SA | ENERGY [MJ] | 77.64 | | | | | | HARDWARE | LOGIC ELEMENTS | 8,172 | | RESOURCES | MEMORY BITS | 47,360 | #### V. CONCLUSIONS This paper has evaluated the performance of hard Raptor decoder on embedded system. The performance of two matrix inversion algorithms on dedicated hardware block for embedded system has been presented. It was shown that the hardware implementation achieves better performance with less energy compared to a software implementation in [5]. Furthermore, it was found that contrary to the recommendation, based on the profiling on a workstation platform, even under hardware implementation the Gaussian elimination performs significantly better than the alternative reportedly more efficient algorithm, in terms of execution speed and energy saving. ## **ACKNOWLEDGEMENTS** This work was (in part) supported by the Center for Distributed Sensor Networks at GIST. #### REFERENCES [1] A. Shokrollahi, *Raptor Codes*, IEEE Transactions of Information Theory, vol. 52, pp. 2551-2567, Jun. 2006. [2] M. Luby, M. Watson, T. Gasiba, T. Stockhammer, and W. Xu, *Raptor codes for reliable download delivery in wireless broadcast systems*, Third IEEE Consumer Communications and Networking Conference, vol.1, Jan. 2006, pp.192-197. [3] 3GPP TS 26.346, Technical Specification Group Services and System Aspects; Multimedia Broadcast/Multicast Services (MBMS); Protocols and codecs, 3GPP Technical Specification, Rev. V7.4.1, Jun. 2007. [4] Digital Video Broadcasting (DVB); IP Datacast over DVB-H: Content Delivery Protocols, ETSI Technical Specification, Rev. V1.2.1, 2006. [5] T. Mladenov, S. Nooshabadi, A. Dassatti, and K. Kim, *Analysis and implementation of raptor codes on embedded system*, 2009 IEEE International Conference on Electronics, circuits and systems, (submitted). [6] D. Parkinson and M. Wunderlich, *A compact algorithm for Gaussian elimination over GF(2) implemented on highly parallel computers*, Parallel Computing, vol. 1, pp. 65073, Aug. 1984.